20 research outputs found

    A highly scalable parallel implementation of H.264

    Get PDF
    Developing parallel applications that can harness and efficiently use future many-core architectures is the key challenge for scalable computing systems. We contribute to this challenge by presenting a parallel implementation of H.264 that scales to a large number of cores. The algorithm exploits the fact that independent macroblocks (MBs) can be processed in parallel, but whereas a previous approach exploits only intra-frame MB-level parallelism, our algorithm exploits intra-frame as well as inter-frame MB-level parallelism. It is based on the observation that inter-frame dependencies have a limited spatial range. The algorithm has been implemented on a many-core architecture consisting of NXP TriMedia TM3270 embedded processors. This required to develop a subscription mechanism, where MBs are subscribed to the kick-off lists associated with the reference MBs. Extensive simulation results show that the implementation scales very well, achieving a speedup of more than 54 on a 64-core processor, in which case the previous approach achieves a speedup of only 23. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the frame latency might increase. Scheduling policies to address these drawbacks are also presented. The results show that these policies combat memory and latency issues with a negligible effect on the performance scalability. Results analyzing the impact of the memory latency, L1 cache size, and the synchronization and thread management overhead are also presented. Finally, we present performance requirements for entropy (CABAC) decoding. This work was performed while the fourth author was with NXP Semiconductors.Peer ReviewedPostprint (author's final draft

    A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures

    No full text
    Efficient utilization of multi-core architectures relies on the partitioning of applications into tasks and mapping the tasks to cores. In some applications (e.g. H.264 video decoding parallelized at macro-block level) these tasks have dependencies among each other. Task scheduling, consisting of selecting a task with satisfied dependencies and mapping it to a core, is typically a functionality delegated to the Operating System. In this paper we present a hardware Task Management Unit (TMU) that looks ahead in time to find tasks to be executed by a multi-core architecture. The look-ahead functionality is shown to reduce the task management overhead by 40-50\% when executing a parallelized version of an H.264 video decoder on an architecture with up to 16 cores. In overall, the TMU-based multi-core architecture reaches a speedup of more than 14x on 16 cores running H.264 video decoding, assuming CABAC is implemented in a dedicated coprocessor

    A Formally Verified Fail-Operational Safety Concept for Automated Driving

    Get PDF
    Modern Automated Driving (AD) systems rely on safety measures to handle faults and to bring the vehicle to a safe state. To eradicate lethal road accidents, car manufacturers are constantly introducing new perception as well as control systems. Contemporary automotive design and safety engineering best practices are suitable for analyzing system components in isolation, whereas today's highly complex and interdependent AD systems require a novel approach to ensure resilience to multiple-point failures. We present a holistic and cost-effective safety concept unifying advanced safety measures for handling multiple-point faults. Our proposed approach enables designers to focus on more pressing issues such as handling fault-free hazardous behavior associated with system performance limitations. To verify our approach, we developed an executable model of the safety concept in the formal specification language mCRL2. The model behavior is governed by a four-mode degradation policy-controlling distributed processors, redundant communication networks, and virtual machines (VMs). To keep the vehicle as safe and cost effective as possible, our degradation policy can reduce driving comfort or AD system's availability using additional low-cost driving channels. We formalized five safety requirements in the modal μ-calculus and proved them against our mCRL2 model, which is intractable to accomplish exhaustively using traditional road tests or simulation techniques. In conclusion, our formally proven safety concept defines a holistic and cost-effective design pattern for AD systems

    communication models for clustered VLIW processors

    No full text
    Clustering is a well-known technique to improve the implementation of single register file VLIW processors. Many previous studies in clustering adhere to an intercluster communication means in the form of copy operations. This paper, however, identifies and evaluates five different inter-cluster communication models, including copy operations, dedicated issue slots, extended operands, extended results, and broadcasting. Our study reveals that these models have a major impact on performance and implementation of the clustered VLIW. We found that copy operations executed in regular VLIW issue slots significantly constrain the scheduling freedom of regular operations. For example, in the dense code for our four cluster machine the total cycle count overhead reached 46.8% with respect to the unicluster architecture, 56 % of which are caused by the copy operation constraint. Therefore, we propose to use other models (e.g. extended results or broadcasting), which deliver higher performance than the copy operation model at the same hardware cost.

    A retargetable fault injection framework for safety validation of autonomous vehicles

    No full text
    Autonomous vehicles use Electronic Control Units running complex software to improve passenger comfort and safety. To test safety of in-vehicle electronics, the ISO 26262 standard on functional safety recommends using fault injection during component and system-level design. A Fault Injection Framework (FIF) induces hard-to-trigger hardware and software faults at runtime, enabling analysis of fault propagation effects. The growing number and complexity of diverse interacting components in vehicles demands a versatile FIF at the vehicle level. In this paper, we present a novel retargetable FIF based on debugger interfaces available on many target systems. We validated our FIF in three Hardware-In-the-Loop setups for autonomous driving based on the NXP BlueBox prototyping platform. To trigger a fault injection process, we developed an interactive user interface based on Robot Operating System, which also visualized vehicle system health. Our retargetable debugger-based fault injection mechanism confirmed safety properties and identified safety shortcomings of various automotive systems

    A retargetable fault injection framework for safety validation of autonomous vehicles

    No full text
    \u3cp\u3eAutonomous vehicles use Electronic Control Units running complex software to improve passenger comfort and safety. To test safety of in-vehicle electronics, the ISO 26262 standard on functional safety recommends using fault injection during component and system-level design. A Fault Injection Framework (FIF) induces hard-to-trigger hardware and software faults at runtime, enabling analysis of fault propagation effects. The growing number and complexity of diverse interacting components in vehicles demands a versatile FIF at the vehicle level. In this paper, we present a novel retargetable FIF based on debugger interfaces available on many target systems. We validated our FIF in three Hardware-In-the-Loop setups for autonomous driving based on the NXP BlueBox prototyping platform. To trigger a fault injection process, we developed an interactive user interface based on Robot Operating System, which also visualized vehicle system health. Our retargetable debugger-based fault injection mechanism confirmed safety properties and identified safety shortcomings of various automotive systems.\u3c/p\u3

    Parallel H.264 decoding on an embedded multicore processor

    No full text
    In previous work the 3D-Wave parallelization strategy was proposed to increase the parallel scalability of H.264 video decoding. This strategy is based on the observation that inter-frame dependencies have a limited spatial range. The previous results, however, investigate application scalability on an idealized multiprocessor. This work presents an implementation of the 3D-Wave strategy on a multicore architecture composed of NXP TriMedia TM3270 embedded processors. The results show that the parallel H.264 implementation scales very well, achieving a speedup of more than 54 on a 64-core processor. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the latencies of some frames might increase. To address these drawbacks, policies to reduce the number of frames in flight and the frame latency are also presented. The results show that our policies combat memory and latency issues with a negligible effect on the performance scalability.Peer Reviewe
    corecore